memory device
SCRec: A Scalable Computational Storage System with Statistical Sharding and Tensor-train Decomposition for Recommendation Models
Yang, Jinho, Kim, Ji-Hoon, Kim, Joo-Young
NN, MM YYYY 1 SCRec: A Scalable Computational Storage System with Statistical Sharding and Tensor-train Decomposition for Recommendation Models Jinho Y ang, Graduate Student Member, IEEE, Ji-Hoon Kim, Graduate Student Member, IEEE, Joo-Y oung Kim, Senior Member, IEEE, Abstract --Deep Learning Recommendation Models (DLRMs) play a crucial role in delivering personalized content across web applications such as social networking and video streaming. However, with improvements in performance, the parameter size of DLRMs has grown to terabyte (TB) scales, accompanied by memory bandwidth demands exceeding TB/s levels. Furthermore, the workload intensity within the model varies based on the target mechanism, making it difficult to build an optimized recommendation system. In this paper, we propose SCRec, a scalable computational storage recommendation system that can handle TB-scale industrial DLRMs while guaranteeing high bandwidth requirements. SCRec utilizes a software framework that features a mixed-integer programming (MIP)-based cost model, efficiently fetching data based on data access patterns and adaptively configuring memory-centric and compute-centric cores. Additionally, SCRec integrates hardware acceleration cores to enhance DLRM computations, particularly allowing for the high-performance reconstruction of approximated embedding vectors from extremely compressed tensor-train (TT) format. By combining its software framework and hardware accelerators, while eliminating data communication overhead by being implemented on a single server, SCRec achieves substantial improvements in DLRM inference performance. It delivers up to 55.77 speedup compared to a CPU-DRAM system with no loss in accuracy and up to 13.35 energy efficiency gains over a multi-GPU system. I NTRODUCTION R RECOMMENDA TION systems are widely used in social network services and video streaming platforms to provide personalized and preferred content to consumers as described in Fig.1. They are also employed in search engines to offer differentiated search services [1]-[5]. For example, more than 80% of Meta's data center resources are allocated to recommendation system inference, while over 50% are utilized for training these systems [6]. Traditional recommendation systems relied on collaborative filtering techniques, such as content filtering using matrix factorization [7]-[10]. However, with advancements in deep neural networks (DNNs), deep learning recommendation models (DLRMs) that combine embedding tables (EMBs) and This work was supported by Samsung Electronics Co., Ltd.. Manuscript received MM dd, yyyy; revised MM dd, yyyy. These models are widely adopted in data centers, with recent focuses on both software-level and hardware-level optimizations [11]- [17]. This combination has demonstrated superior recommendation performance, making DLRM the industry standard in recommendation systems.
- Asia > South Korea > Daejeon > Daejeon (0.04)
- North America > United States (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
- Asia > South Korea > Gyeonggi-do > Suwon (0.04)
Analog Bayesian neural networks are insensitive to the shape of the weight distribution
Patel, Ravi G., Xiao, T. Patrick, Agarwal, Sapan, Bennett, Christopher
Recent work has demonstrated that Bayesian neural networks (BNN's) trained with mean field variational inference (MFVI) can be implemented in analog hardware, promising orders of magnitude energy savings compared to the standard digital implementations. However, while Gaussians are typically used as the variational distribution in MFVI, it is difficult to precisely control the shape of the noise distributions produced by sampling analog devices. This paper introduces a method for MFVI training using real device noise as the variational distribution. Furthermore, we demonstrate empirically that the predictive distributions from BNN's with the same weight means and variances converge to the same distribution, regardless of the shape of the variational distribution. This result suggests that analog device designers do not need to consider the shape of the device noise distribution when hardware-implementing BNNs performing MFVI.
- North America > United States > Rocky Mountains (0.04)
- North America > Canada > Rocky Mountains (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Government > Regional Government > North America Government > United States Government (0.94)
- Energy (0.69)
PIFS-Rec: Process-In-Fabric-Switch for Large-Scale Recommendation System Inferences
Huo, Pingyi, Devulapally, Anusha, Maruf, Hasan Al, Park, Minseo, Nair, Krishnakumar, Arunachalam, Meena, Akbulut, Gulsum Gudukbay, Kandemir, Mahmut Taylan, Narayanan, Vijaykrishnan
Deep Learning Recommendation Models (DLRMs) have become increasingly popular and prevalent in today's datacenters, consuming most of the AI inference cycles. The performance of DLRMs is heavily influenced by available bandwidth due to their large vector sizes in embedding tables and concurrent accesses. To achieve substantial improvements over existing solutions, novel approaches towards DLRM optimization are needed, especially, in the context of emerging interconnect technologies like CXL. This study delves into exploring CXL-enabled systems, implementing a process-in-fabric-switch (PIFS) solution to accelerate DLRMs while optimizing their memory and bandwidth scalability. We present an in-depth characterization of industry-scale DLRM workloads running on CXL-ready systems, identifying the predominant bottlenecks in existing CXL systems. We, therefore, propose PIFS-Rec, a PIFS-based scheme that implements near-data processing through downstream ports of the fabric switch. PIFS-Rec achieves a latency that is 3.89x lower than Pond, an industry-standard CXL-based system, and also outperforms BEACON, a state-of-the-art scheme, by 2.03x.
- North America > United States > Pennsylvania > Centre County > University Park (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- Europe (0.04)
- Asia > Middle East > Jordan (0.04)
MoNDE: Mixture of Near-Data Experts for Large-Scale Sparse Models
Kim, Taehyun, Choi, Kwanseok, Cho, Youngmock, Cho, Jaehoon, Lee, Hyuk-Jae, Sim, Jaewoong
Mixture-of-Experts (MoE) large language models (LLM) have memory requirements that often exceed the GPU memory capacity, requiring costly parameter movement from secondary memories to the GPU for expert computation. In this work, we present Mixture of Near-Data Experts (MoNDE), a near-data computing solution that efficiently enables MoE LLM inference. MoNDE reduces the volume of MoE parameter movement by transferring only the $\textit{hot}$ experts to the GPU, while computing the remaining $\textit{cold}$ experts inside the host memory device. By replacing the transfers of massive expert parameters with the ones of small activations, MoNDE enables far more communication-efficient MoE inference, thereby resulting in substantial speedups over the existing parameter offloading frameworks for both encoder and decoder operations.
Efficient and Economic Large Language Model Inference with Attention Offloading
Chen, Shaoyuan, Lin, Yutong, Zhang, Mingxing, Wu, Yongwei
Transformer-based large language models (LLMs) exhibit impressive performance in generative tasks but introduce significant challenges in real-world serving due to inefficient use of the expensive, computation-optimized accelerators. This mismatch arises from the autoregressive nature of LLMs, where the generation phase comprises operators with varying resource demands. Specifically, the attention operator is memory-intensive, exhibiting a memory access pattern that clashes with the strengths of modern accelerators, especially as context length increases. To enhance the efficiency and cost-effectiveness of LLM serving, we introduce the concept of attention offloading. This approach leverages a collection of cheap, memory-optimized devices for the attention operator while still utilizing high-end accelerators for other parts of the model. This heterogeneous setup ensures that each component is tailored to its specific workload, maximizing overall performance and cost efficiency. Our comprehensive analysis and experiments confirm the viability of splitting the attention computation over multiple devices. Also, the communication bandwidth required between heterogeneous devices proves to be manageable with prevalent networking technologies. To further validate our theory, we develop Lamina, an LLM inference system that incorporates attention offloading. Experimental results indicate that Lamina can provide 1.48x-12.1x higher estimated throughput per dollar than homogeneous solutions.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
The Combination of Metal Oxides as Oxide Layers for RRAM and Artificial Intelligence
Resistive random-access memory (RRAM) is a promising candidate for next-generation memory devices due to its high speed, low power consumption, and excellent scalability. Metal oxides are commonly used as the oxide layer in RRAM devices due to their high dielectric constant and stability. However, to further improve the performance of RRAM devices, recent research has focused on integrating artificial intelligence (AI). AI can be used to optimize the performance of RRAM devices, while RRAM can also power AI as a hardware accelerator and in neuromorphic computing. This review paper provides an overview of the combination of metal oxides-based RRAM and AI, highlighting recent advances in these two directions. We discuss the use of AI to improve the performance of RRAM devices and the use of RRAM to power AI. Additionally, we address key challenges in the field and provide insights into future research directions
- Research Report (1.00)
- Overview (0.89)
- Semiconductors & Electronics (0.49)
- Energy (0.34)
Neuromorphic memory device simulates neurons and synapses: Simultaneous emulation of neuronal and synaptic properties promotes the development of brain-like artificial intelligence
Neuromorphic computing aims to realize artificial intelligence (AI) by mimicking the mechanisms of neurons and synapses that make up the human brain. Inspired by the cognitive functions of the human brain that current computers cannot provide, neuromorphic devices have been widely investigated. However, current Complementary Metal-Oxide Semiconductor (CMOS)-based neuromorphic circuits simply connect artificial neurons and synapses without synergistic interactions, and the concomitant implementation of neurons and synapses still remains a challenge. To address these issues, a research team led by Professor Keon Jae Lee from the Department of Materials Science and Engineering implemented the biological working mechanisms of humans by introducing the neuron-synapse interactions in a single memory cell, rather than the conventional approach of electrically connecting artificial neuronal and synaptic devices. Similar to commercial graphics cards, the artificial synaptic devices previously studied often used to accelerate parallel computations, which shows clear differences from the operational mechanisms of the human brain.
Memory Association Networks
Kim, Seokjun, Jang, Jaeeun, Jang, Yeonju, Choi, Seongyune, Kim, Hyeoncheol
Various networks have been designed in the deep learning field to date. Typically, images, sounds, text, hierarchical, and relational data are learned through the networks, and inductive learning is performed. But these networks are limited to specific datasets or specific tasks. Therefore, we designed artificial association networks that can simultaneously learn various datasets in one network like humans. And in the second study, deductive association networks were proposed to perform deductive reasoning.